Signal Recovery in Unions of Subspaces with Applications to 

Compressive Imaging 

Nikhil Rao^, Benjamin Recht* and Robert D. Nowak^ 
* Computer Sciences Department, University of Wisconsin-Madison 
' Electrical and Computer Engineering, University of Wisconsin-Madison 

September 2012 
Abstract 

In applications ranging from communications to genetics, signals can be modeled as lying in 
a union of subspaces. Under this model, signal coefficients that lie in certain subspaces are active 
or inactive together. The potential subspaces are known in advance, but the particular set of 
subspaces that are active (i.e., in the signal support) must be learned from measurements. We 
show that exploiting knowledge of subspaces can further reduce the number of measurements 
required for exact signal recovery, and derive universal bounds for the number of measurements 
needed. The bound is universal in the sense that it only depends on the number of subspaces 
under consideration, and their orientation relative to each other. The particulars of the sub- 
spaces (e.g., compositions, dimensions, extents, overlaps, etc.) does not affect the results we 
obtain. In the process, we derive sample complexity bounds for the special case of the group 
lasso with overlapping groups (the latent group lasso), which is used in a variety of applications. 
Finally, we also show that wavelet transform coefficients of images can be modeled as lying in 
groups, and hence can be efficiently recovered using group lasso methods. 

Keywords. Union of Subspaces, Group Sparsity, Convex Optimization, Structured Sparsity, 
Compressed Sensing 



1 Introduction 

In many fields such as genetics, image processing, and machine learning, one is faced with the task 
of recovering very high dimensional signals from relatively few measurements. In general this is not 
possible, but fortunately many real world signals are, or can be transformed to be, sparse, meaning 
that only a small fraction of signal coefficients is non-zero. Compressed Sensing [5 10 allows us to 



recover sparse, high dimensional signals with very few measurements as compared to the ambient 
signal dimension. In fact, results indicate that one only needs 0{s ■ logp) random measurements 
to exactly recover an s sparse signal of length p. 

In many applications however, one not only has knowledge about the sparsity of the signal, but 
some additional information about the structure of the sparsity pattern as well: 

1. In genetics, the genes are arranged into pathways/clusters, and genes belonging to the same 
pathway are often active/inactive in a group [3~H|37). 



2. In image processing, the wavelet transform coefficients can be modeled as belonging to a tree, 
with parent-child coefficients simultaneously being large or small [8 , 34 . 
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3. In wideband spectrum sensing applications, the spectrum typically displays clusters of non- 
zero frequency coefficients, each corresponding to a narrowband transmission |26 



4. In applications in analog compressed sensing [15,16 , the signals can be expressed as lying in 
a union of shift invariant subspaces. 



5. In reconstruction of signals having a finite rate of innovation 11 401, non zeros are known to 



be clustered spatially, corresponding to objects in a scene for example. 

In cases such as these, the sparsity pattern can be represented lying in a union of certain 
subspaces (e.g., coefficients in certain pathways, tree branches, frequency bands, or clusters). This 
knowledge about the signal structure can help further reduce the number of measurements one 
needs to exactly recover the signal. In this paper, we derive bounds on the number of random 
i.i.d. Gaussian measurements needed to exactly recover a sparse signal when its pattern of sparsity 
lies in a union of subspaces, based on solving a convex recovery algorithm. This characterization 



specializes to the latent group lasso, introduced in |20, 30, 32 , wherein the sparsity pattern can be 
expressed as lying in a union of groups. 

We analyze the recovery problem using a random Gaussian measurement model. We emphasize 
that although the derivation assumes the measurement matrix to be Gaussian, it can be extended 
to any subgaussian case, by paying a small constant penalty, as shown in [25]. We restrict ourselves 
to the Gaussian case here since it highlights the main ideas and keeps the analysis as simple as 
possible. 



1.1 Prior Work 

To the best of our knowledge, these results are new and distinct from prior theoretical characteri- 
zations of group lasso and the general union of subspace methods. Sampling theorems for unions 
of subspaces have been considered in [4 24 , where the authors show that the sample complexity 
depends logarithmically on the number of subspaces under consideration. In [l4], the authors also 
propose a greedy scheme to recover signals that lie in such unions. The authors in [19] derive infor- 
mation theoretic bounds for the number of measurements needed for a variety of signal ensembles, 
including trees. In [2, 12 , the authors show that one needs far fewer measurements when the signal 
can be expressed as lying in a union of subspaces, and explicit bounds are derived when using a 
modified version of CoSaMP 
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to recover the signal. Asymptotic consistency results are derived 
for the group lasso 42 when the groups partition the space of variables in [l] . Similarly, in [18] , 
the authors again consider the groups to partition the space, and derive conditions for recovery 



using the group lasso. In [2Tj[22|, the authors derive consistency results for the group lasso under 

the authors consider overlapping groups and derive 



arbitrary groupings of variables. Also, in 
sample bounds. The authors in 
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derive consistency results in an asymptotic setting, for the 
group lasso with overlap, but do not provide exact recovery results. The group lasso with overlap 
(called the latent group lasso in [30] 



is analyzed in detail in 30 , 32 



Non- greedy schemes have not been developed to handle the case of the union of subspaces, nor 
the latent group lasso. Although group lasso with overlapping groups have been considered in the 
past, it yields vectors whose support can be expressed as a complement of a union of groups, while 
we consider cases where we require the support to lie in a union of groups, a distinction made 
in [20] . In the applications considered above, however, it is imperative that we recover a sparsity 
pattern that lies in a union of groups (or subspaces in more generality). 
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1.2 Our contributions 



To derive our results, we appeal to the notion of restricted minimum singular values of an operator. 
The restricted minimum singular value (or equivalently the restricted eigenvalues of the gram 
matrix of the operator) conditions are weaker than the well known Restricted Isometry Conditions 
of operators, and have been used and studied in |3,38|, among others. 



We bound number of measurements needed for exact recovery with two terms. One term (kB) 
grows linearly in the total number of non-zero coefficients (with a constant of proportionality). This 
is close to the bare minimum of one measurement per non-zero component. Intuitively, this term 
corresponds to the magnitude estimation once the locations of the non zero components have been 
determined. The other term logarithmically depends on the number of subspaces under consider- 
ation and their relative orientations, and not the particulars of the subspaces (e.g., compositions, 
dimensions, extents, etc.). In particular, the subspaces need not be disjoint. Intuitively, this term 
corresponds to the price we pay for the detection of the non zeros in the signal (the active sub- 
spaces). The degree to which subspaces overlap, remarkably, has no effect on our bounds. In this 
regard, our bounds can be termed to be universal. This is somewhat surprising since overlapping 
subspaces are strongly coupled in the observations, tempting one to suppose that overlap may make 
recovery more challenging. 

Our main theoretical result shows that for signals with support on k of M possible subspaces, 
exact recovery is possible from Ci*(y / 2 log(M — k) J r ^/~B) 2 k + C2*kB measurements using a latent 
group lasso type algorithm, B being the maximum subspace dimension. The constants C\ and C2 
depend on the relative angle between subspaces, and will be explicitly derived in the sequel. Note 
that the bound depends on the sparsity s of the signal via the kB term. The latent group lasso 
reduces to a special case of our result. We will routinely compare the performance of the group lasso 
to the standard lasso, to study the effects of overlap between subspaces on the actual number of 
measurements needed to exactly recover a signal. For the lasso bound, we will use the one derived 
in jfjj: (2s + 1) log(p — s). Assuming that M = 0(poly(p)), our bound is roughly k log(p) + kB. For 
the same problems, the lasso which ignores the group structure of the sparse signal components 
would require approximately kB\og{p) measurements. Hence, taking advantage of the subspace 
structure will allow us to take fewer measurements to reconstruct the signal. 

Note that in this work, the subspaces can be arbitrary, and we make no assumptions about their 
nature, except that they are known in advance. In short, we derive bounds for any generic union of 
subspaces, whether they overlap or form a partition of the ambient high dimensional space. Note 
that, when we say "do not overlap", we mean that the intersection of the subspaces is {0}, the all 
zeros vector. 

We then propose a novel way to model wavelet coefficients of images in this framework, and 
show that we perform at least as well as several other state of the art methods in compressive 
imaging. 

To summarize, our contributions in this paper are as follows: 

1. We derive non asymptotic sample complexity bounds in a compressive-sensing framework 
when the measurement matrix is i.i.d. Gaussian, and the signal can be expressed as lying in 
a sparse union of finite dimensional subspaces. 

2. We show that our bound holds regardless of the nature of overlap between subspaces. In this 
sense, the bounds we derive are universal. 
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3. We show that the group lasso with overlapping groups is a special case of the general result 
that we derive 

4. We propose a new method to model wavelet coefficients of signals, so that one can use convex 
optimization algorithms to recover the signal exactly, or with high fidelity in the presence of 
noise. 

5. An extensive series of experiments verify the theory. 

The rest of this paper is organized as follows: In section [2] we set up the problem of recovering a 
signal lying in a union of subspaces, and present some preliminaries that will be useful in deriving 
the bounds. Our result for the sample complexity for the union of subspaces is derived in section 
[3j We then extend the framework for the case of group lasso with overlapping groups in section 
|4j Section [5] extends our results to approximately sparse signals. In section [6] we propose a novel 
framework for modeling wavelet transform coefficients of images, that makes use of the concepts we 
present in the sections that precede it. Experimental validation is provided in section [7j Finally, 
we conclude our paper in section [8j and present avenues for future work. 

2 The Union of Subspaces Model 

In the course of the next sections, we derive measurement bounds for the exact recovery of a 
signal lying in a sparse union of finite dimensional subspaces, and the robust recovery of one that 
is approximately sparse. In this section, we will argue as to why exact recovery of the signal 
corresponds to the minimization of the atomic norm of the signal, with the atoms obeying certain 
properties governed by the signal structure. Before we do so, we dispense with the notations. 

2.1 Notations 

Consider a signal of length p, that is s sparse. Note here that in case of multidimensional signals 
like images, we assume they are vectorized to have length p. 
Suppose we are given a set of bases for M subspaces 



Let the dimensions of each subspace be given by d±, c#2, . . . , c?m, with B = max, dj. So, ifj E W i . 
We assume that span[K\K2 ■ . ■ Km] = Without loss of generality, assume each K{ to be 
orthonormal. If not, we can perform the Gram Schmidt procedure to orthonormalize them. The 
subspaces can be overlapping or non overlapping. When we say two subspaces do not overlap, we 
mean that 



Also, the subspaces may or may not be perpendicular to each other. Two subspaces are per- 
pendicular if 



JC = {K 1 ,K 2 ,..., 



K M } 



span(fQ) n span(Kj) = {0} 



I {k a , kb) \ =0 V/c a E Ai and Vfc;, E Aj 




where 



Ai = span(7fj)\(span(iQ) n span(i^j)) 
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and 

Aj = span(Kj)\(s-pan(Ki) n span(Kj)) 

The notion of subspaces being perpendicular to each other is clarified in Fig. [l| It is fairly 
obvious that the more perpendicular the subspaces are to each other, the more separated they are, 
and hence the easier to is to distinguish between which among the subspaces is active. 




Figure 1: Three subspaces, commonly encountered in subspace clustering methods. Subspaces S2 
and S3 are perpendicular to each other. However, SI and S2 are not perpendicular, nor are SI and 
S3 

We denote vectors by bold lowercase letters (a, v etc.), and matrices by bold uppercase letters 
(M, etc.). Subscripts following a vector denotes a particular index of the vector. For any vector 
v, we will routinely use the following decomposition: 

M 

i=i 

where v l G M. di . The decomposition holds since we assume spanQ.K1.K2, • • • ,Km\) = Super- 
scripts following a matrix will denote the sub matrix whose columns are the columns indexed by 
the superscript. | • | denotes the cardinality of a set. 

We let x* be the (subspace sparse) signal to be recovered, whose non zero coefficients lie in k 
of the M subspaces fC* C K, with k « M. Formally, noting that x* = ^ KiX* 1 , 

JC* = {Ki G K, : ||x^|| + 0} 

We then have \JC*\ = k. Let the indices of then active subspaces be given by J. That is, 

Jc{l,2,...,p}:{jG J KjefC*} 

Later in the paper, we will also consider approximately sparse signals. We let <& n xp be a measure- 
ment matrix consisting of i.i.d. Gaussian entries of mean and variance ^ so that every column 
is a realization of an i.i.d. Gaussian length n vector with covariance matrix ^J. We denote the 
observed vector by y G W 1 : y = $>x*. The absence of a subscript following a norm || • || implies 
the £2 norm. The dual norm of || • \\ p is denoted by || • ||*. The convex hull of a set of points S is 
denoted by conv(5). We let ai(M) denote the i th singular value of a matrix M. We define 

K* = [K h K h ...) G J 
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and 

K =[K l K 2 ...K M ] 
Finally, let n(K) be the condition number of K. 

2.2 Atoms, Atomic Set and the Atomic Norm 

To begin with, let us formalize the notion of atoms and the atomic norm of a signal (or vector). 
We will restrict our attention to signals in M p that can be expressed as lying in a sparse union of 
subspaces, though the same concepts can be extended to other spaces as well. We assume that 
x € W can be decomposed as : 



^Qa\ a > 



x = 

i=i 

The vectors a 1 £ W are called atoms, and form the basic building blocks of any signal, which can 
be represented as a conic combination of the atoms. Note that the sum notation, rather than the 
integral notation, implies that only a countable number of coefficients can be non-zero. We denote 
A = {a} to be the atomic set. Given a vector x £ W and an atomic set, we define the atomic 
norm as 

= inf I y^c a ■ x = V] c a a, c a > \fa e A\ (2) 



^2 c a '■ x = c a a, c a >0 Va G A > 



The atomic decomposition of the signal yields a representation of a signal in terms of some pre- 
defined atoms. Usually, few atoms used in a representation indicates a "simpler" representation. 
Hence, to obtain a "simple" representation of a vector, we look to minimize the atomic norm subject 
to constraints 

x = argmin ||x||^ s.t. y = &x (3) 

When the measurements are corrupted with noise 6, such that < e, the atomic norm 
minimization problem becomes: 

x = argmin ||x||^4 s.t. \\y — 3>a;|| 2 < e 2 (4) 

Indeed, when the atoms are merely the canonical basis in W, the atomic norm reduces to the 
standard i\ norm, and minimization of the atomic norm yields the well known lasso procedure [39]. 

Assuming we are aware of the subspaces /C, we now proceed to define the atomic set and the 
corresponding atomic norm for our framework. Let 

Ai = {a£R p :3a l :a = K i a i , ||a|| = Hc^H = 1} 

A = ufi.Ai (5) 

The sub vectors a € A{ form the boundary of the unit sphere restricted to the span of K^. 
The atomic norm of a vector x £ W is given by 

[|x|U= min a Ci>0 (6) 

x = J2 a c i a 
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Lemma 2.1 The atomic norm for a signal x E M p lying in a union of subspaces is given by 

\\x\\ A = min y \\ al \\ 

Proof The result follows by substituting 

C{ = H-fQc^ll = ||o; l || 

and 



KiO^W J \\\ a ' 

The notion of the atomic set and the corresponding atomic norm for the union of subspaces 
model is made clear in Fig [2} The figure shows the atomic norm balls, given by the convex hull of 
the atomic set A. 





(a) Norm ball for Perpenducu- (b) Norm ball for Non- 
lar subspaces perpendicular subspaces 

Figure 2: Atomic norm balls for a pair of perpendicular and non-perpendcular subspaces. The 
atomic set corresponds to the union of boundaries of the uint disks. Each disk corresponds to a 
particular Ai in (l5|) 



Also note that we can directly compute the dual of the atomic norm from the set of atoms 

||sb||^ = sup(:c, a) 

a 

= max \\{Ki) T x\\ (7) 
i=l,2,. ..M 

That is, the dual norm is the maximum over the norms of the projections of x onto the different 
subspaces, noting that [{Ki) T Ki] -1 = I. The dual norm will be useful in our derivations below. 

2.3 Gaussian Widths and Exact Recovery 

Following ro] j we define the tangent cone and normal cone at x* with respect to conv(A) under 
||a:||_4 as 1 33 1 : 

Ta(x*) = cone{z - x* : \\z\\^ < \\\x*\\a} (8) 
Ar A (x*) = {u : (u,z)<0, VzeT A (x*)} (9) 
= {u : (u,x*) = 7lWU 
and \\u\\ A < 7 for some 7 > 0} 
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We note that, from (6] (Prop. 2.1), x = x* ([3]) is unique iff 



null(*) n T A (x*) = {0} (10) 

Hence, we require that the tangent cone at x* intersects the nullspace of 3> only at the origin, to 
guarantee exact recovery. 

Before we state the main recovery result from (61, we define the Gaussian width of a set: 

Definition 2.2 Let denote the unit sphere in MP . The Gaussian width uj(S) of a set S £ S p_1 
is 



where g ~ A/"(0,I) is an i.i.d. standard Gaussian vector. 



sup g T z 



Gordon used the Gaussian width to provide bounds on the probability that a random subspace of 
a certain dimension misses a subset of the sphere |17( . In (6J, these results are specialized to the 
case of atomic norm recovery. In particular, we will make use of the following: 

Proposition 2.3 / Corollary 3.2] Let $ : MP — > MP be a random map with i.i.d. zero-mean 
Gaussian entries having variance 1/n. Further let Q = Tj^{x*) nS p ~ 1 denote the spherical part of 
the tangent cone T^{x*). Suppose that we have measurements y = <&x* , and we solve the convex 
program g). Then x* is the unique optimum of |3p with high probability provided that 

n > u{n) 2 + <D{l). 

To complete our problem setup we will also restate Proposition 3.6 in (6j : 

Proposition 2.4 ( Proposition 3.6) Let C be any non-empty convex cone in MP, and let 
g ~ A/"(0, /) be a Gaussian vector. Then: 

oj(Cn$ p - 1 ) < E g [dist( ff ,C*)] (11) 

where dist(.,.) denotes the Euclidean distance between a point and a set, and C* is the dual cone 
ofC 



We can then square (11) use Jensen's inequality to obtain 

u(C n S^ 1 ) 2 < E s [dist( ff , C*f] (12) 

We note here that the dual cone of the tangent cone is the normal cone, and vice- versa. 

Thus, to derive measurement bounds, we only need to calculate the square of the Gaussian 
width of the intersection of the tangent cone at x* with respect to the atomic norm and the unit 
sphere. This value can be bounded by the distance of a Gaussian random vector to the normal 



cone at the same point, as implied by (12). In the next section, we derive bounds on this quantity. 



S 



3 Gaussian Width of the Normal Cone for Unions of Subspaces 

For generic subspaces fC, we have 



ceJ\f^(x*) <^=> 37 > : (c,x*) = j\\x*\\a, 

||c*|| = 7 if KielC*, ||c*|| < 7 if Ki £/C*. (13) 

We now prove the main result of this paper, a sufficient number of Gaussian measurements 
needed to recover a signal lying in a union of subspaces: 



Theorem 3.1 To exactly recover a k-subspace sparse signal decomposed into M subspaces in W, 

] {^/2\og{M -k) + VBfk + 2kBK 2 {K) 

i.i.d. Gaussian measurements are sufficient. 
To prove this result, we need two lemmas: 



Lemma 3.2 Let gi, . . . , aqi be L, x-squared random variables with d-degrees of freedom. Then 



E[ max qi ] < (V21og(L) + Vd) 2 . 

Ki<L 



We defer the proof to appendix 9.1 



Lemma 3.3 Suppose ueP is supported on some set of groups /C* C fC. Then, 



Ml < VW\ *p(.K*) \M*A- 



We defer the proof of this lemma to appendix 9.2 



Proof [Proof of Theorem 3.1 



Intuition: Note that, from (12), the Gaussian width of the intersection of the tangent cone at 



x* with the unit sphere is bounded above by the expected euclidean distance between a random 



Gaussian vector and the normal cone at a;* (13). We can further bound this distance by the distance 



between a random Gaussian vector g and a particular vector r G Na(x*), as shown in (14). We 
proceed to construct such a vector r and prove the result 



E g [dist( ff , C*) 2 ] < E 9 [dist( ff , r)% r e Af A (x*) 



(14) 
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Now, since we assume span(/C) = MP, we can write, for any vector rj, 
So, we have for x*. 

x* = Kx* = K s x* S + K sc x* SC 

where S C {1, 2, . . . , cij} is the indices of active coefficients of x*. It is important to note that 
Mi f J, x* 1 is a sub vector of x* Also, note here that K x* = 0. 

Since the normal cone is nonempty, there exists a v G Na(x*) with = 1 and v l = Vi ^ J. 

Since i> is in the normal cone, it will also satisfy (v,x*) = [| as"*" || ^4.- We will use this v in our bound 
below. 

Suppose w ~ A/"(0, I p ) is a vector with i.i.d. Gaussian entries. We then have 



w = Kw 

= K s w s + K sc w sc 

Let t(w) = maxj^j 

since to = Y$Li , we have ^ = Kj [KK 7 )" 1 w, giving us w { ~ Af(0, (KK T y 2 K t ). 
This means that 

Ni| 2 ~||(^ T )- 2 ||x| (15) 

So, || to* 1 1 2 is a scaled x 2 random variable with di degrees of freedom. The scaling factor is merely 
a-\K). 

Let us now construct a vector r E N_a{x*). We can write, as for w 

r = K s f s + K sc f sc 

Now let f s = t(w)v s , and f sc = w sc 

From (13), and from our definition of t(w), we have r 6 Na(x*). Referring to (12), we now 
consider the expected squared distance between Na(x*) and w: 
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E[dist(w,C7*) 2 ] 
< E[||r - it'll 2 ] 



E 
E 
E 



K s f s + K a f 



s c =s c 



K s w s + K°w 



S C «T.S C \\ 2 



K s f s - K^w^lf + \\K a 'f 



'S.T.S 



s c =s c 



\K s f s - K s w s \\ 2 



K sc w sc 



(») 



2E 
2E 



S=S\ 



K s t(w)v 



+ 2E 



S„T.S i 



\K w 



s\\ 2 



+ 2E 



(iii) 



2E^M 2 1 \\K s v s \\ 2 + 2E 



Ik 5 ™ 5 ! 



2E[i(w) 2 ] ||Kv|| 2 + 2E 



(*") r,T!7r J ./-..\2l ll_.l|2 



2E[t(w) z ] |H| Z + 2E 



\K s w S f 



\K s w s \ 



(«) 

< 2/c 



(V21og(M- A;) + /B) 2 + 2E 



\K s w s f 



<2k(^r4) {y/2 MM - k) + VS) 2 + 2kBK\K) 

Where (i) trivially follows because the indices in S and 5" c are disjoint, (ii) follows from the 
result \\a — b\\ 2 < 2(||a|| 2 + ||6|| 2 ) (hi) follows from the fact that v is deterministic, (iv) follows from 
the fact that v is only supported on S (v) follows from Lemma 3.2, Lemma 3.3 and (15). Finally, 
(vi) follows from bounding the last term as shown in appendix 9.3 and noting that jiS"! < kB. ■ 



3.1 Remarks 

1. The most important thing to note from our result is that we pay no extra penalty in terms of 
the number of measurements needed when the subspaces overlap. Hence, we term our result 
"universal" . 

2. The kB term in the bound is an upper-bound on the signal sparsity. In the case of highly 
overlapping subspaces, this value may be much larger than the signal sparsity, but such cases 
seldom arise in real-world applications. If the subspace dimensions are vastly different, then it 
is pessimistic to bound the quantity with the maximum dimension B, but this yields a simple 
expression for the measurements needed. It is of course possible to obtain tighter bounds 
using the techniques in our work for cases where the groups are of varying sizes. 



3. It can be seen from Theorem 13.11 that the number of measurements is linear in k and B. 
Hence, the number of measurements that are sufficient for signal recovery grows linearly with 
the number of active subspaces in the signal, and also the maximum subspace dimension. 
This can be seen analogous to the linear dependence of the lasso bound on the sparsity s of 
the signal. 
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4. We note that although we pay no extra price to measure the signal when there is significant 
overlap between subspaces, there is an additional cost in the recovery process of the signal, in 
that the subspaces need to first be separated by replication of the coefficients 



20 



or resort 



to a primal-dual method to solve the problem 27 



5. In the bound we get, cr p (K) captures the price we pay if the subspaces are non perpendicular. 
Indeed, if subspaces are nearly aligned with each other, then it becomes nearly impossible 
to distinguish between them. This is reflected by the fact that a p (K) — > as the subspaces 



become more aligned. Fig. 3(a) shows this phenomenon for the case of two subspaces of one 
dimension each that become more and more aligned with each other. Similarly, in the second 
term, the condition number of the matrix K , k(K) is determined by the angle between 
subspaces. The more the subspaces are close to each other, the higher the condition number, 
and subsequently the more measurements we need. This can be seen from Fig. |3(b)| 




(a) a p (K) as the angle between 2 subspaces (b) k(K) as the angle between 2 subspaces is var- 
is varied ied 

Figure 3: As the two subspaces become more separated (6 — > 90°), both the quantities approach 
1. As {9 —7- 0°), (jp(K) —7- and n{K) — > oo, indicating that it becomes impossible to distinguish 
between active subspaces 



3.2 Perpendicular Subspaces 

When the subspaces are perpendicular to each other, as defined in Q, we can make the bound we 
obtained much tighter. To see this, note that when the subspaces are perpendicular, equation (ii) 
in the proof of Theorem 3.1 can be replaced by 

E[||K s r 5 || 2 ] + E[\\K s w s \\ 2 } 

This follows since, in the case of perpendicular subspaces, f s is independent of w s . Also, since the 
subspaces are perpendicular, we have 

a p (K*) = a p (K) = k{K) = 1 
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Substituting the values in the bound we get from Theorem |3.1[ we have 



E[dist(>, C*)] < fc( v /21og(M - k) + v 7 ^) 2 + kB (16) 

for perpendicular subspaces. This is a much smaller quantity than the general result we obtained 
in Theorem |3.1[ underscoring the fact that recovery becomes easier as the subspaces get more and 
more separated. In the next section, we show that the group lasso with overlapping groups (also 
called the latent group lasso) is a special case of recovery when subspaces are perpendicular. 



4 The Group Lasso with Overlapping Groups 



The group lasso with overlapping groups, [20j[30j[32j can be formulated as an atomic norm mini- 
mization problem. In the group lasso problem, we are given a set of groups Q = {G%, G2, . . . , Gm}, 
Gi C {1, 2, ... p} and we wish to recover a k-group sparse vector x* from compressive measurements. 
In this case, we can define the subspaces Ki E K, as follows: 

ViG{l,2,...,M},^ = / G ' (17) 

where I Gi is the sub matrix of the identity matrix, consisting of columns indexed by the group G{. 

We now show that minimizing the atomic norm under the atomic set arising out of these 
subspaces yields the group lasso with overlapping groups. Note that, under the definition of the 
subspaces as in (17), and referring to ([I]), we have that Ai is merely the unit sphere restricted to 



\ X \\A 



the dimensions indexed by group Gj. 

Lemma 4.1 Suppose the atomic set is given as in (J). Let Ki = I G - Vi = {1,2,... M}, where 
d C {1,2,... p}. Then, 

\\x\\ a = fir , (x) 

II H^h overlaps > 

where ^^.^(a 5 ) is the overlapping group lasso norm defined in 20L 

Proof In Q, we can substitute vq = cqcl, giving us cq = \cq\ • \\o\\ = ||cg<x|| = \\vg\\- Hence, 

inf < c a : x = c a a c a > Va G A > 
iaeA aeA ) 

inf <^ ^ \\vg\\ ■ x = ^2,v G \ 
VGeg Geg ) 

fi g , (x) 

overlaps I 



Corollary 4.2 Under the atomic set defined in when Ki = I Gi Vi = {1, 2, . . . M}, 

\\ x \\a = 11^11 

Geg 
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Proof ^v er i ap (x) = Y.Gag \\ x g\ \ in the non overlapping case. 



Thus, @ yields: 

x = argmin ^ eWap (x) s.t. y = 3>x (18) 

which can be solved using |20| . 

It is not hard to see that, in the case of disjoint groups, 

M A {x*) = {z e W : zi = VGeg*, (19) 

II^gII 

INgII <7 VG ^ 0*,7>0} 

However, in the case of overlapping groups, as in the case of generic subspaces, no such closed form 
exists. 

Under the group sparsity model defined above, it is not hard to see that a p (K*) = k(K) = 1. 
Also, note that the subspaces are perpendicular to each other since the basis vectors of each subspace 
is aligned with one or more of the coordinate axes in MP. As a consequence, we obtain the following 
result: 



Theorem 4.3 To exactly recover a k- group sparse signal decomposed into M groups in W, the 
following is a sufficient number of Gaussian measurements needed: 

(v / 21og(M - k) + \^B) 2 k + kB 
4.1 Remark: Comparison with the lasso 

We compare the group lasso bound we obtain to the standard lasso measurement bound: 

(2s + 1) log(p - s) (20) 



The bound we obtain in Theorem 4.3 can be upper bounded by 

2k max{2 log(M), B} + kB 



(21) 



to 



Noting that s < kB with equality when the groups do not overlap. In this case, (21) evaluates 

2.s 



B 



max{21og(M), J B} + s 
max{21og(M), J B} 



(2s + 1 



B 



which is smaller than the lasso bound ( 20 ) by a factor of roughly 
bound shows that the we can perform be 



log(M) 
Slog(p) 



. So, in most cases, our 
let than the conventional lasso by exploiting the additional 



group structured information that is available. 
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5 Approximately Sparse Signals 



The result we proved in Theorem |3 . 1 1 apply to exact reconstruction of k— subspace sparse signals. 
In many cases however, the signals are not exactly subspace (or group) sparse, but approximately 
so. Specifically, in a context we are especially interested in, the ordered wavelet coefficients of 
natural images exponentially decay to zero, and hence can be modeled as approximately sparse, 
the "sparsity" meaning that only few coefficients are of significant magnitude. 
In such cases, we can model the approximately sparse signal /* as 

f = x* + h* 

where x* is a k— subspace sparse approximation of /*, retaining the k subspaces having largest 
norm and h* corresponds to the remaining coefficients that are small in magnitude. Clearly, we 
can bound \\h*\\ above by some constant, say Ch- 

Now, measuring the approximately sparse signal /* using a Gaussian measurement matrix 
amounts to 

= <$>x* + 
= $x* + 

Since the norm of 3? is bounded, we can write 

||0|| < S 

The results we have obtained thus far can be easily extended to the case where we obtain such 
bounded noisy observations. In the noisy case, we observe 

y = &x* + 6, \\0\\<S 

We then solve the atomic norm minimization problem, with a relaxed constraint to take into account 
the bounded noise: 

x = argmin| |ai| |_4 s.t. \\y — «&sg|| < 5 (22) 

xeRp 

We restate corollary 3.3 from [6j: 

Proposition 5.1 / ((^, Corollary 3.3] Let $ : MP — > K n be a random map with i.i.d. zero-mean 
Gaussian entries having variance 1/n. Further let $7 = Tj±{x*) n S^" 1 denote the spherical part 
of the tangent cone Tj^(x*). Suppose that we have measurements y = $>x* + 9, and \\9\\ < 5. 



Suppose we solve the convex program (22). Let x denote the optimum of (22). Also, suppose 
||*&z|| > e\\z\\ \/z G Ta{x*). Then \\x* — x\ < — with high probability provided that 



Substituting the result of Theorem 3.1 in Proposition 5.1, we have the following corollary 
yielding a sufficient condition to accurately recover a signal when the measurements are corrupted 
with bounded noise: 
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Corollary 5.2 Suppose we wish to recover a signal that lies in k out of M arbitrarily defined 
subspaces, such that the maximum subspace dimension is B. Let the set of active subspaces be 



denoted by 1C*. Let x be the optimum of the convex program (22). To have \\x — x*\\ < — with 
high probability, 



i.i.d. Gaussian measurements are sufficient. 



2 ^5y(v /21 °g( M - k ) + ^B?k + 2kBK 2 {K 



e) 2 



Note that we merely need to set o~ p (K*) = k(K) = 1, and remove the '2' from both terms to 
obtain the corresponding result for the latent group lasso: 



( v / 21og(M -k) + VB) 2 k + kB 
6 Compressive Imaging with Group Sparsity 

We consider the compressive imaging problem, that is to recover an image from a small number of 
random measurements. Here "small" is used relative to the ambient dimension of the image. The 



standard lasso 39 formulation is given by 



x = argmin -\\y — &x\\ 2 + A||cc||i (23) 
x 2 

The i\ norm acts as a surrogate for the sparsity of the signal. The lasso aims to recover a signal 
that is sparse, by setting most coefficients of x to be zero. For the exact recovery case, the lasso 
problem is equivalent to the Basis Pursuit [7] 

x = argmin ||£c||i s.t. y = *&x (24) 

X 

The lasso penalty reflects the fact that the wavelet coefficients are approximately sparse, but in 
reality not all patterns of sparsity are equally plausible/probable. For example, Fig ( |4(b)[ ) shows the 



DWT coefficients of the barbara image, and Fig. (4(c) ) shows the same coefficients, but randomly 
scrambled. Clearly, the t\ norm of both sets of coefficients will be the same. This shows that the 
lasso penalty in itself is invariant to any structure present in the sparse coefficients. 

To model this structure that is inherently present between wavelet transform coefficients of im- 
ages, |8||13U34| propose making use of graphical models such as Hidden Markov Trees. HMT's, while 



providing good performance in image denoising applications (where «fr = J) in ( 23 ) , cannot provide 
acceptable reconstruction for other, more general inverse problems. This is because the presence of 
a (non identity) sensing matrix $ (randomly) mixes up the coefficients for every measurement yi 
obtained. 

To overcome this mixing between the coefficients, many alternatives have been proposed. |35| 
propose using a version of loopy belief propagation to solve the recovery problem. The authors 
in [2 12 generalize the notion of restricted isometry properties to signals that lie in unions of 
subspaces, and use a modified version of CoSAMP [28| to solve the inverse problem. Greedy 



and/or suboptimal iterative reconstruction schemes are used in [13,23 . Finally, the authors in |36 
propose modeling the coefficients using an HMT, and using the Approximate Message Passing 
algorithm |9| to solve the compressed sensing problem. 
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(a) Original image (b) 3 -stage DWT of the bar- (c) Coefficients of the DWT of 

bara image the Barbara image, random- 

ized 

Figure 4: The l\ norms of both (a) and (b) are exactly equal, since they do not take structure into 
account 



All the works mentioned above sacrifice the recovery guarantees and the easy analysis that 
convex optimization algorithms provide, for the sake of modeling the dependencies between DWT 
coefficients. This motivates out problem: can we on the one hand model the dependencies among 
wavelet transform coefficients, while at the same time propose to solve a convex optimization 
problem similar to (23)? 



To this end, we model the parent-child coefficients into groups. Parent-child pairs of wavelet 
transform coefficients across scales and at similar locations tend to be simultaneously high or low. 
Hence, we can take advantage of this dependency and use group lasso methods to recover the image. 
Fig [5] shows a representative example. 




Figure 5: Quadtree corresponding to the 2-d DWT. At each scale, parent coefficients can be grouped 
with child coefficients. 



We wish to recover the non zero coefficients lying on the wavelet tree shown in Fig [5] When 
coefficients are modeled into groups, one can use the group lasso 42 to recover the coefficients 

1 M 
x = argmin —\\y — <f>x\\ 2 + A \\xG t II (25) 

i=i 

where XQ t is the vector x whose coefficients not indexed by group Gi are set to zero. The 



17 



group lasso as shown in ( 25 ) suffers from a drawback however. It was recently argued in [20] that 
the sparsity pattern recovered by the group lasso can be expressed as a complement of a union of 
groups. One look at Fig. [5] tells us that we are interested in the recovery of sparsity patterns that 
can be expressed as a union of (overlapping) groups. To this end, the authors in |20| propose the 
latent group lasso |30l|32| 



x = arg mm 



l\\V ~ + MLrlapW ( 26 ) 



where ^ oveT i ap { x ) is the latent group lasso norm. 

The latent group lasso lends itself well to the sort of problems we are concerned about in this 
paper. For a thorough analysis of the various properties of the latent group lasso penalty, we 
refer the interested reader to [30] . In the next section, we show how the latent group lasso can 
be effectively used to recover images, when we model the DWT coefficients to lie in parent-child 
groups. 



7 Experiments 

In this section we aim to show two things: 



1. The bound we derived in Theorems 3.1 and 4.3 holds for a wide variety of cases, and is 



invariant to the grouping observed in the signal 

2. By modeling the DWT coefficients of images (and ID signals of course) into parent-child 
groups, we can recover the signal efficiently and exactly/robustly 

Also, henceforth we refer to the latent group lasso method as Glasso. 



7.1 Sampling Bounds for Subspace Sparse Signals 

We extensively tested our method against the standard lasso procedure. In the case where the 

groups overlap, we use the replication method outlined in [20], to reduce the optimization problem 

to that of non overlapping groups. We compare the number of measurements needed for our method 
with that needed for the lasso. For the lasso, it would be instructive to keep in mind the bound 

derived in [6] , viz. (2s + 1) log(p — s). In the case of non overlapping groups, the bound evaluates 
to (2kB + l)log(/cM — kB). We generate length p = 2000 signals, made up of M = 100 non- 
overlapping groups of size B = 20. We set k = 5 groups to be "active", and the values within the 
groups are drawn from a uniform [0, 1] distribution. The active groups are assigned uniformly at 
random. The sparsity of the signal will thus be s = 100 

We use SpaRSA |41| for the lasso and the group lasso with overlap, learning A over a grid. 
Fig. [^displays the mean reconstruction error \\x — x*\\\/p as a function of the number of random 
measurements taken. The errors have been averaged over 100 tests, and each time a new random 
signal was generated with the above mentioned parameters. 

From the parameters considered, we conclude that ~ 380 measurements are sufficient to recover 
the signal. When we have 380 measurements, the lasso does not recover the signal exactly, as seen 
in Fig [6j but the latent group lasso does. 

To show that the bound we compute holds regardless of the complexity of groupings, we consider 
the following scenario: Suppose we have M = 100 groups, each of size B = 40. k = 5 of those 
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0.02 




100 200 300 400 500 600 

# measurements 

Figure 6: The group lasso (red) compared with the lasso (blue). The vertical line indicates our 
bound. Note that our bound (380) predicts exact recovery of the signal, while at the same value, 
the lasso does not recover the signal 



groups are active, and the values within each group are assigned from a uniform [—1,1] distribution. 
We arrange these groups in three configurations: 

1. The groups do not overlap, yielding a signal of length p = 4000, and signal sparsity s = 200. 

2. A partial overlapping scenario, where apart from the first and last group, every group has 
20 elements in common with a group above it, and 20 common with the group below, giving 
p = 2020, s 6 [120, 200] depending on which of the 100 groups are active. 

3. A random overlap case where the first 50 groups are non overlapping and the remaining 50 
are assigned uniformly at random from the existing p = 2000 indices, s < 200 in this case. 

The scenarios we consider are depicted in Fig. [7j In each of the cases, we compute the bound to 
be ~ 630. The bound becomes looser as the complexity of the groupings increases. This, as argued 
before, is a result of the bound for the signal sparsity becoming looser. 



... : ] 



Figure 7: Types of groupings considered. Each set of coefficients encompassed by one color belongs 
to one group. 
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We can see from Fig. 8(a) that our group lasso bound (« 630) holds for all cases. For the sake 
of comparison, we considered the lasso performance on the signals in cases (1) - (3) as well, and 
these are plotted in Fig. 



Kb) 



From the values of p and s computed for the three cases, we have 
the corresponding bounds for the lasso [6] to be 3305 for the no overlap case (1), [1819, 3010] for 
the partial overlap case (2) and 3000 for case (3). 




100 



200 



300 400 
#measurements 



500 



600 




300 400 
#measurements 



600 



(a) performance of the group lasso on cases con- (b) performance of the lasso on cases considered 
sidered in Figure [7| Note that our bound evaluates in Figure [7] 
to 630, clearly sufficient measurements to recover 
the signal in all cases. 



Figure 8: (Best seen in color) Performance on various grouping schemes. The group lasso outper- 
forms the lasso in all cases 



We consider exact recovery of the wavelet transform coefficients of the "blocks" signal (Fig. 



9(a)). We group the wavelet transform coefficients into parent child pairs as outlined in Section 
6] In this case, for a p = 16384 length signal, we have M = 16382 groups, and the maximum 
group size is B = 2. We use the Haar wavelet bases to decompose the image. Fig. |9(b)| shows 
the reconstruction obtained from 1690 measurements, corresponding to the bound computed for 
k = 47. Of course, we can compute k since we have the original signal with us. We see that our 
bound yields a sufficient number of measurements for exact recovery. 

Our final experiment outlines the relationship between the number of measurements taken and 
the size of the problem. We generated test signals that were group sparse, with each active group 
having coefficients selected randomly from a uniform U[— 1, 1] distribution. We fix the group size 
B to be 6. We consider two cases: 

• The non overlapping case (1), and 

• The partial overlapping case (2) 



Fig. 10 shows the probability of error as the number of measurements increases. In the figure, 
note that we show the total number of groups (M) in the signal. For each M, we fix the group 
sparsity level k to be M/10. The results are averaged over 100 tests, and the probability of error is 



computed empirically. It can be seen in Fig. 10 that, regardless of the groups overlapping or not, 



we need roughly the same number of measurements to achieve a low probability of error. 
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5000 10000 15000 5000 10000 15000 

(a) original signal (b) reconstruction 

Figure 9: Exact reconstruction of a length 16384 signal from 1690 measurements in the wavelet 
domain 




Figure 10: Number of groups vs Number of measurements for exact signal reconstruction. The 
color bar indicates the probability of error, computed empirically. Notice how the number of mea- 
surements needed to achieve low probability of error is nearly the same in both cases, highlighting 
that we indeed do not pay a penalty for complicated grouping strategies 



7.2 Modeling DWT coefficients into groups 



In the spirit of |2|36|, we considered a 128 x 128 section of the cameraman image, and obtained 5000 
iid gaussian measurements from it. No noise was added to the image. We compare our methods 



in Fig. 11(b) 



with the ones displayed in |36|. Fig 11(a) has been taken directly from [36] , and our result is shown 



For the basis of comparison, we zoom into similar regions from the best performing methods in 



Fig. 11(a) and Fig. 11(b) , in Fig. 12 It can be seen that our method performs comparably to the 



turbo-BG and turbo-GM methods. 

Along similar lines, we tested our methods for noiseless image recovery using the Microsoft 
Research Object Class Recognition databas^j The dataset consists of images categorized into 20 



1 http:/ /research. microsoft.com/en-us/projects/ObjectClassRecognition 
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(a) Reconstruction Performance of Various Methods 




(b) Reconstruction us- 
ing Glasso 



Figure 11: Reconstruction of a section of the cameraman image using various methods. 
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types, with roughly 30 images of each type. We used the first 10 images of each type to generate a 
training set of 200 images, which was used to learn the regularization parameters. To compare and 
contrast our results with other methods tested in [36] (Fig. 6^ we compute the Normalized Mean 
Square Error (NMSE) of our methods over the same dataset for comparison. The normalized mean 
square error (in dB) for the true image x is given by 10 x log ^ • We resized the images to 

size 128 x 128, and obtained 5000 measurements for each case. 




CoSaMP 

ModelCS 

BG-AMP 

SPGL1 

VB 

MCMC 
Turbo-BG 
Turbo-GM 
Glassa 



10 12 

Image type 



Figure 13: comparison of various methods, and Glasso. Note that, the lower the value of NMSE, 
the better the performance. 




Figure 14: Comparison of the two methods in the presence of noise. 



Fig. 14 shows the results we obtain as a function of the noise standard deviation. For the 



purpose of the experiment, we consider piecewise constant signals of length 1024, having 5 jumps. 
The location of the jumps is chosen at random, and the magnitude of each "piece" is chosen 
uniformly between [—1,1]. We take 256 measurements for both the lasso and group lasso. From 
the figure, it is clear that by modeling the wavelet coefficients into parent-child pairs, we can better 
reconstruct signals in the presence of noise. The results are averaged over 1000 randomly generated 
signals. 



The authors thank Subhojit Som and Phil Schniter for sharing data for Fig. 
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8 Conclusions and Discussion 



In this paper, we showed that one can recover a signal known to lie in a sparse union of subspaces 
exactly using atomic norm minimization algorithms. The number of measurements needed depend 
on the number of active subspaces, their dimensions and the relative angles between subspaces. 
We also showed that the measurement bound we derived is universal, in that it holds regardless 
of the nature and specific structure of the subspaces, in terms of overlaps. Indeed, the bounds 
can be specialized to cases where one is interested in a specific type of grouping. We subsequently 
extended these results for signals lying in a generic union of groups. 

We also proposed a novel modeling strategy for DWT coefficients of signals. By modeling the 
coefficients into parent-child groups, we were able to take advantage of convex recovery methods 
that provably guarantee signal recovery. We showed that our method is at least as good as the 
current state of the art in non adaptive compressed sensing. 

We note here that we do not claim the optimality of the particular grouping method that we 
have used , viz. grouping the parent child pairs together. How best to group wavelet coefficients is 
still an open question, and is an avenue for further research. We prefer the parent-child pairs for 
its simplicity, and due to the fact that the groupings yield acceptable results, as seen in Section [7j 

9 Appendices 

9.1 Proof of Lemma 13.21 

Proof Let Ml '■= maxi<j<£ For t > 0, we have that 

E[M L ] = l ^Mt^[ML])] 

W log[E[exp(t • M L )]] 
t 

(6) log[E[maxi<j< L exp(t • qj)]] 
t 

W log[LE[exp(t • gi)]] 
t 

log(L) - I log(l - 2t) 
t 

Where (a) follows from Jensen's inequality , (b) follows from the monotonicity of the exponential 
function, and (c) merely bounds the maximum by the sum over all the elements. Now, setting 

t = [(2 + 2e)]- 1 with e = ^Jp} y ields 

E[M L ] < ( V / 2log(Z) + Vd) 2 



Note that t can be optimized depending on the application. We use this particular choice 
because it makes no assumptions about the relative magnitudes of (M — k) and B. 
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9.2 Proof of Lemma [3731 



Proof Let K* = [Kj]j & j*. By duality, it suffices to show that \\z\\j\. < vf^*I cr p(-^*)ll 2; ll f° r au 2 
with supp(z) C fC*. For any such z, there exists a representation z = fC*b such that 

z=[K jl K j2 ...K j]J] ][b 1 b 2 ...b^Y 

= E w 

KaK,* 

so that none of the supports of b l overlap. It then follows that 

NU = II E 

K&C* 



(ii) 
< 



1/2 



E im s 



. 1^ 

= 

( < } vl^l INI 

Where (i) follows from the definition of the norm || • ||_4, (ii) is a consequence of the relation 
< V^H/^lb for k dimensional vectors (3 and (iii) follows from minimizing [|6[| subject to 
z = K*b m 



9.3 Bounding the last term in the proof of Theorem 3.1 
Proof First, note that we can write 

E [|| K s w s || 2 ] = E [||KP s -u)|| 2 ] 

= E [\\KP s K T {KK T )- 1 w\\ 2 ] 

Where -Ps(-) is the operator that projects (•) onto the space spanned by the indices in S. 
Now, 

E [\\KP s K T (KK T )- 1 w\\ 2 ] 

= E \\\w T (KK T )- 1 KP'gK T KPsK T {KK T )- 1 wf] 
= E [tr {KP s K T {KK T Y l ww T (KK T y l KP^K T )] 
= tr (KP s K T (KK T y 1 E[ww T ]{KK T y 1 KP^K T ) 

( = } tr {KP s K T (KK T y 1 (KK T )- 1 KP^K T ) 
= \\(KK T y'KP s K T \\ 2 F 
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where (a) is a consequence w being a standard Gaussian vector. Now, let the singular value 
decomposition of K be K = U"EV T . Then, 



\\(KK T )- l KP s K T f F 

= || (UY, 2 U T y l UT I V T P S VT,U T 

= \\U- Tl E- 2l EV T P s V 1 EU T \\ 2 F 

= llS^^pll^PsFll^HEll 2 
= k(K) 2 \\P s \\ 2 f 
= k(K) 2 \S\ 
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